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Abstract 

Large-scale neuroimaging studies have been collecting brain images of study individ- 
uals, which take the form of two-dimensional, three-dimensional, or higher dimensional 
arrays, also known as tensors. Addressing scientific questions arising from such data 
demands new regression models that take multidimensional arrays as covariates. Sim- 
ply turning an image array into a long vector causes extremely high dimensionality 
that compromises classical regression methods, and, more seriously, destroys the in- 
herent spatial structure of array data that possesses wealth of information. In this 
article, we propose a family of generalized linear tensor regression models based upon 
the Tucker decomposition of regression coefficient arrays. Effectively exploiting the low 
rank structure of tensor covariates brings the ultrahigh dimensionality to a manageable 
level that leads to efficient estimation. We demonstrate, both numerically that the new 
model could provide a sound recovery of even high rank signals, and asymptotically 
that the model is consistently estimating the best Tucker structure approximation to 
the full array model in the sense of Kullback-Liebler distance. The new model is also 

_J compared to a recently proposed tensor regression model that relies upon an alternative 

O CANDECOMP/PARAFAC (CP) decomposition. 
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a 1 Introduction 

Advancing technologies are constantly producing large scale scientific data with complex 
structures. An important class arises from medical imaging, where the data takes the form of 
multidimensional array, also known as tensor. Notable examples include electroencephalogra- 
phy (EEG, 2D matrix), anatomical magnetic resonance images (MRI, 3D array), functional 
magnetic resonance images (fMRI, 4D array), among other image modalities. In medical 
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imaging data analysis, a primary goal is to better understand associations between brains 
and clinical outcomes. Applications include using brain images to diagnose neurodegenera- 
tive disorders, to predict onset of neuropsychiatric diseases, and to identify disease relevant 
brain regions or activity patterns. This family of problems can collectively be formulated 
as a regression with clinical outcome as response, and image, or tensor, as predictor. How- 
ever, the sheer size and complex structure of image covariate pose unusual challenges, which 
motivate us to develop a new class of regression models with image covariate. 

Most classical regression models take vector as covariate. Naively turning an image array 
into a vector is evidently unsatisfactory. For instance, a typical MRI image of size 128-by- 
128-by-128 implicitly requires 128 3 = 2, 097, 152 regression parameters. Both computability 
and theoretical guarantee of the classical regression models are severely compromised by 
this ultra-high dimensionality. More seriously, vectorizing an array destroys the inherent 
spatial structure of the image array that usually possesses abundant information. A typical 
solution in the literature first employs the subject knowledge to extract a vector of features 



from images, and then feeds the feature vector into a classical regression model (Mckeown 



et~aTj [I998| |Blankertz et~aT| [200T| |Haxby et aT| [200T| [Kontos et al.[ [20031 pitchell et al. 



2004 LaConte et al. , 2005 Shinkareva et al. , 2006). Alternatively one first applies unsu- 



pervised dimension reduction, often some variant of principal components analysis, to the 



image array, and then fits a regression model in the reduced dimensional vector space ( Caffo 



et al. 2010). Both solutions are intuitive and popular, and have enjoyed varying degrees 



of success. At heart, both transform the problem to a classical vector covariate regression. 
However, there is no consensus on what choice best summarizes a brain image even for a 
single modality, whereas unsupervised dimension reduction like principal components could 
result in information loss in a regression setup. In contrast to constructing an image fea- 
ture vector, the functional approach views image as a function and then employs functional 



regression models (Ramsay and Silverman, 2005). Reiss and Ogden (2010) notably applied 



this idea to regression with 2D image predictor. Extending their method to 3D and higher 
dimensional images, however, is far from trivial and requires substantial research, due to the 
large number of parameters and multi-collinearity among imaging measures. 



In a recent work, Zhou et al. (2013) proposed a class of generalized linear tensor regression 



models. Specifically, for a response variable Y, a vector predictor Z G IR Po and a D- 



dimensional tensor predictor X G jrpi x --- x pd ^ the response is assumed to belong to an 
exponential family where the linear systematic part is of the form, 

g(fi) = YZ+(B,X). (1) 

Here g(-) is a strictly increasing link function, \x = E(Y\X,Z), 7 G IR Po is the regu- 
lar regression coefficient vector, B G ]RPi x - x Pn j s the coefficient array that captures the 
effects of tensor covariate X, and the inner product between two arrays is defined as 
(B,X) = (vecB,vecX) = ^2i i Pii...i D x h-iD- This model, if with no further simplifi- 
cation, is prohibitive given its gigantic dimensionality: po + Y\a=i Vd- Motivated by a com- 



monly used tensor decomposition, Zhou et al. (2013) introduced a low rank structure on the 



coefficient array B. That is, B is assumed to follow a rank-i? CANDECOMP/PARAFAC 



(CP) decomposition (Kolda and Bader, 2009) 
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where (3 d G JR Pd are all column vectors, d = 1, . . . , D,r = 1, . . . , R, and o denotes an outer 
product among vectors. Here the outer product 61 o 6 2 ° ■ ■ ■ ° bp of D vectors bd G lR Pd , 
d = 1, . . . , D, is defined as the p\ x • • • X pr> array with entries (61 o 6 2 o • • • o bD)i v -i D = 
EL=i^i d - For convenience, this CP decomposition is often represented by a shorthand 
B = lBi,...,B D }, where B d = [$\ . . . ,/3^ } ] G !R^ xi? , d = 1,...,D. Combining % 
and ([2]) yields generalized linear tensor regression models of Zhou et al. (2013), where the 
dimensionality decreases to the scale of po + R x J2d=iP<i- Under this setup, ultrahigh 
dimensionality of (fil) is reduced to a manageable level, which in turn results in efficient 
estimation and prediction. For instance, for a regression with 128-by-128-by-128 MRI image 
and 5 usual covariates, the dimensionality is reduced from the order of 2, 097, 157 = 5 + 128 3 
to 389 = 5 + 128 x 3 for a rank-1 model, and to 1, 157 = 5 + 3 x 128 x 3 for a rank-3 model. 



Zhou et al. (2013) showed that this low rank tensor model could provide a sound recovery 



of even high rank signals. 

In the tensor literature, there has been an important development parallel to CP decom- 
position, which is called Tucker decomposition, or higher-order singular value decomposition 



(HOSVD) (Kolda and Bader, 2009). In this article, we propose a class of Tucker tensor 



regression models. To differentiate, we call the models of Zhou et al. (2013) CP tensor 



regression models. Specifically, we continue to adopt the model (py), but assume that the 
coefficient array B follows a Tucker decomposition, 

Ri Rd 

B = J^---^9r 1 ,...,r D (3 i { l) o...o(3^\ (3) 

r 1= l r D =l 

where (3 J G IR Pd are all column vectors, d = 1, . . . ,D,rd = l,...,i?d, and g ri ,...,r D are 
constants. It is often abbreviated as -B = [G; Bi, . . . ,B D J, where G G ]R r i x - xR d j s a 
/^-dimensional core tensor with entries (G) ri ... r£) = g ri ,...,r D , and -B^ G IR^*-^ are the factor 
matrices. BJs are usually orthogonal and can be thought of as the principal components in 
each dimension (and thus the name, HOSVD). The number of parameters of a Tucker tensor 
model is in the order of po + Yld=i ^ d x Pd- Comparing the two decompositions (b| and (J3J), 
the key difference is that CP fixes the number of basis vectors R along each dimension of B 
so that all BJs have the same number of columns (ranks). In contrast, Tucker allows the 
number R<i to differ along different dimensions and B^'s could have different ranks. 

This difference between the two decompositions seems minor; however, in the context of 
tensor regression modeling and neuroimging analysis, it has profound implications, and such 
implications motivate this article. On one hand, the Tucker tensor regression model shares 
the advantages of the CP tensor regression model, in that it effectively exploits the special 
structure of the tensor data, it substantially reduces the dimensionality to enable efficient 
model estimation, and it provides a sound low rank approximation to a potentially high rank 
signal. On the other hand, Tucker tensor regression offers a much more flexible modeling 
framework than CP regression, as it allows distinct order along each dimension. When the 
orders are all identical, it includes the CP model as a special case. This flexibility leads to 
several improvements that are particularly useful for neuroimaging analysis. First, a Tucker 
model could be more parsimonious than a CP model thanks to the flexibility of different 
orders. For instance, suppose a 3D signal B G ]R 16xl6x16 admits a Tucker decomposition 
PJ with Ri = R2 = 2 and -R3 = 5. It can only be recovered by a CP decomposition with 
R = 5, costing 230 parameters. In contrast, the Tucker model is more parsimonious with only 
131 parameters. This reduction of free parameters is valuable for medical imaging studies, as 
the number of subjects is often limited. Second, the freedom in the choice of different orders 
is useful when the tensor data is skewed in dimensions, which is common in neuroimaging 
data. For instance, in EEG, the two dimensions consist of electrodes (channels) and time, 



and the number of sampling time points usually far exceeds the number of channels. Third, 
even when all tensor modes have comparable sizes, the Tucker formulation explicitly models 
the interactions between factor matrices B^s, and as such allows a finer grid search within a 
larger model space, which in turn may explain more trait variance. Finally, as we will show 
in Section 2.3, there exists a duality regarding the Tucker tensor model. Thanks to this 
duality, a Tucker tensor decomposition naturally lends itself to a principled way of imaging 
data downsizing, which, given the often limited sample size, again plays a practically very 
useful role in neuroimaging analysis. 

For these reasons, we feel it important to develop a complete methodology of Tucker ten- 
sor regression and its associated theory. The resulting Tucker tensor model carries a number 
of useful features. It performs dimension reduction through low rank tensor decomposition 
but in a supervised fashion, and as such avoids potential information loss in regression. It 
works for general array-valued image modalities and/or any combination of them, and for 
various types of responses, including continuous, binary, and count data. Besides, an efficient 
and highly scalable algorithm has been developed for the associated maximum likelihood es- 
timation. This scalability is important considering the massive scale of imaging data. In 
addition, regularization has been studied in conjunction with the proposed model, yielding a 
collection of regularized Tucker tensor models, and particularly one that encourages sparsity 
of the core tensor to facilitate model selection among the defined Tucker model space. 

Recently there have been some increasing interests in matrix/tensor decomposition and 
their applications in brain imaging studies (Crainiceanu et al. 2011[ [Allen et al." 2011 Hoff 



2011 Aston and Kirch, 2012). Nevertheless, this article is distinct in that we concentrate 



on a regression framework with scalar response and tensor valued covariates. In contrast, 



Crainiceanu et al. (2011) and Allen et al. (2011) studied unsupervised decomposition, Hoff 



(2011) considered model-based decomposition, whereas Aston and Kirch (2012) focused on 



change point distribution estimation. The most closely related work to this article is Zhou 



et al. (2013); however, we feel our work is not a simple extension of theirs. First of all, 



considering the complex nature of tensor, the development of the Tucker model estimation 



as well as its asymptotics is far from a trivial extension of the CP model of Zhou et al. 



(2013). Moreover, we offer a detailed comparison, both analytically (in Section 2.4) and 



numerically (in Sections 6.3 and 6.4), of the CP and Tucker decompositions in the context 



of regression with imaging/tensor covariates. We believe this comparison is crucial for an 
adequate comprehension of tensor regression models and supervised tensor decomposition in 
general. 

The rest of the article is organized as follows. Section [2] begins with a brief review of 
some preliminaries on tensor, and then presents the Tucker tensor regression model. Section 
[3] develops an efficient algorithm for maximum likelihood estimation. Section [4] derives 
inferential tools such as score, Fisher information, identifiability, consistency, and asymptotic 
normality. Section [5] investigates regularization method for the Tucker regression. Section [6] 
presents extensive numerical results. Section [7] concludes with some discussions and points 
to future extensions. All technical proofs are delegated to the Appendix. 

2 Model 

2.1 Preliminaries 

We start with a brief review of some matrix/array operations and results. Extensive refer- 



ences can be found in the survey paper (Kolda and Bader, 2009). 



A tensor is a multidimensional array. Fibers of a tensor are the higher order analogue of 
matrix rows and columns. A fiber is defined by fixing every index but one. A matrix column 
is a mode-1 fiber and a matrix row is a mode-2 fiber. Third-order tensors have column, row, 
and tube fibers, respectively. We next review some important operators that transform a 
tensor into a vector /matrix. The vec operator stacks the entries of a D- dimensional tensor 
B G ]Rp jX " x po into a column vector. Specifically, an entry bi x ,„i D maps to the j-th entry of 
vec I? where j = 1 + ^2 d=1 (id — 1) Yld~=iPd'- For instance, when D = 2, the matrix entry at 
cell (ii, i 2 ) maps to position j = 1 + i L — 1 + (i 2 — l)pi — %\ + (i 2 — l)pi, which is consistent 
with the more familiar vec operator on a matrix. The mode-d matricization, B(d), maps a 
tensor B into a pdX Yld'^dPd.' matrix such that the (zi, . . . , Id) element of the array B maps 
to the [id-,]) element of the matrix B( d ), where j = 1 + Y^d'^dtid' ~ 1) Tld"<d',d"^dPd"- When 
D = 1, we observe that vecB is the same as vectorizing the mode-1 matricization Bm. 
The mode-(d,d') matricization Bud?) G ']B? dPd '' K '^ d "^ d < d ' Pd " is defined in a similar fashion. We 
then define the mode-d multiplication of the tensor B with a matrix U G lR PdXq , denoted by 
B x d U G ]RPi x - x 'i x - x 'PD^ as th e multiplication of the mode-c? fibers of B by U. In other 



words, the mode-<i matricization of B x d U is UB^ d y 

We also review two properties of a tensor B that admits a Tucker decomposition ^. 
The mode-d matricization of B can be expresses as 

B {d) = B d G {d) (B D ® • • • ® # d+1 (8) B d ^ <g> • • ■ <g> Bi) T , 

where ® denotes the Kronecker product of matrices. If applying the vec operator to B, then 

vecS = veaB(i) = vec(SiG ( i ) (B jD <g> • • • ® £ 2 ) T ) = (Bd <g) • • • <8> B^vecG. 

These two properties are useful for our subsequent Tucker regression development. 

2.2 Tucker Regression Model 

We elaborate on the Tucker tensor regression model introduced in Section 1. We assume that 



Y belongs to an exponential family with probability mass function or density (McCullagh 



andNelder, 1983), 



p(Vi\0i, 4>) = exp <^ l l ,,, + c{vi, 4>) 

with the first two moments E(Yi) = \x% = b'(9i) and Var(yi) = of = b" (^)oj ((f)). 9 and > 
are, respectively, called the natural and dispersion parameters. We assume the systematic 
part of GLM is of the form 

Hi Rd 

gfr) = V = YZ+(J^-.-J^ g ri ^r D ^ l] o • • • o (3% D \ X). (4) 

n=l r D =l 

That is, we impose a Tucker structure on the array coefficient B. We make a few remarks. 
First, in this article, we consider the problem of estimating the core tensor G and factor 
matrices B d simultaneously given the response Y and covariates X and Z. This can be 
viewed as a supervised version of the classical unsupervised Tucker decomposition. It is 
also a supervised version of principal components analysis for higher-order multidimensional 
array. Unlike a two-stage solution that first performs principal components analysis and then 
fits a regression model, the basis (principal components) B d in our models are estimated 



under the guidance (supervision) of the response variable. Second, the CP model of Zhou 
et al. (2013) corresponds to a special case of the Tucker model ^ with g ri ,...,r D = l{ri=-=rr>} 



and R\ — . . . — Rd = R- In other words, the CP model is a specific Tucker model with 



a super-diagonal core tensor G. The CP model has a rank at most R while the general 
Tucker model can have a rank as high as R D . We will further compare the two model sizes 
in Section 12^41 

2.3 Duality and Tensor Basis Pursuit 

Next we investigate a duality regarding the inner product between a general tensor and a 
tensor that admits a Tucker decomposition. 

Lemma 1 (Duality). Suppose a tensor B e ]Rpi x -- x pd a d m it s Tucker decomposition B = 
{G;B 1 ,...,B D }. Then, for any tensor X e W lX '" x " D > (B,X) = (G,X), where X admits 
a Tucker decomposition X = [X; B\, . . . , BJjJ . 

This duality gives some important insights to the Tucker tensor regression model. First, if 
we consider B d e ^Vd^Rd as fixed and known basis matrices, then Lemma [l] says fitting 
the Tucker tensor regression model Q is equivalent to fitting a tensor regression model in 
G with the transformed data X = {X; B\, . . . , B T D \ € 1R RiX " xRd . When R d < p d , the 
transformed data X effectively downsize the original data. We will further illustrate this 



downsizing feature in the real data analysis in Section |6.4| Second, in applications where 
the numbers of basis vectors R d are unknown, we can utilize possibly over-complete basis 
matrices B d such that R d > p d , and then estimate G with sparsity regularizations. This 



leads to a tensor version of the classical basis pursuit problem (Chen et al. , 2001). Take 
fMRI data as an example. We can adopt the wavelet basis for the three image dimensions 
and the Fourier basis for the time dimension. Regularization on G can be achieved by either 
imposing a low rank decomposition (CP or Tucker) on G (hard thresholding) or penalized 
regression (soft thresholding). We will investigate Tucker regression regularization in details 
in Section HI 

2.4 Model Size: Tucker vs CP 

In this section we investigate the size of the Tucker tensor model. Comparison with the 
size of the CP tensor model helps gain better understanding of both models. In addition, it 
provides a base for data adaptive selection of appropriate orders in a Tucker model. 

First we quickly review the number of free parameters pc for a CP model B = [i?i, . . . , B d \ 
with B d e JR PdXR . For D = 2, p c = R( Pl +p 2 )-R 2 , and for D>2,p c = R(J2d=iPd~ D + 1 )- 



Table 1: Number of free parameters in Tucker and CP models. 



CP 



Tucker 



D = 2 
D>2 



R(pi + V2) - R 2 P1R1 + P2R2 + R1R2 -R\-Rl 
R(E d p d -D + i) E d p d R d + U d R d -E d R 2 d 



For D = 2, the term — R 2 adjusts for the nonsingular transformation indeterminacy for model 
identifiability; for D > 2, the term R(—D + 1) adjusts for the scaling indeterminacy in the 



CP decomposition. See Zhou et al. (2013) for more details. Following similar arguments, we 
obtain that the number of free parameters px in a Tucker model B = [G; B 1 , . . . , B d J, with 
G E U RlX '"* Ri and B d e lR PdXRd , is 

D D D 

P T = j2pd R d+Il R d-J2 R l 

d =i d =i d =i 

for any D. Here the term ~Yl ld=1 R d adjusts for the non-singular transformation indetermi- 
nancy in the Tucker decomposition. We summarize these results in Table [TJ 

Next we compare the two model sizes (degrees of freedom) under an additional assumption 
that R\ = ■ ■ ■ = R d = R. The difference becomes: 



Pt ~Pc 



(0 

R(R-l)(R-2) 

R(R 3 -4R + 3) 
^R(R D - 1 -DR + D-l) 



when D = 2, 
when D = 3, 
when D = 4, 
when D > 4. 



Based on this formula, when D = 2, the Tucker model is essentially the same as the CP 
model. When D — 3, Tucker has the same number of parameters as CP for R = 1 or 
R = 2, but costs R(R — 1)(R — 2) more parameters for R > 2. When D > 3, Tucker and 
CP are the same for R = 1, but Tucker costs substantially more parameters than CP for 
R > 2. For instance, when D = 4 and R — 3, Tucker model takes 54 more parameters 
than the CP model. However, one should bear in mind that the above discussion assumes 
Ri = ■ ■ ■ = R d = R. In reality, Tucker could require less free parameters than CP, as 
shown in the illustrative example given in Section 1, since Tucker is more flexible and allows 
different order R^ along each dimension. 

Figure [T] shows an example with D = 3 dimensional array covariates. Half of the true 
signal (brain activity map) B is displayed in the left panel, which is by no means a low rank 
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Figure 1: Left: half of the true signal array B. Right: Deviances of CP regression estimates 
at R — 1, . . . , 5, and Tucker regression estimates at orders (i?i, R 2 , R3) = (1, 1, 1), (2, 2, 2), 
(3,3,3), (4,4,3), (4,4,4), (5,4,4), (5,5,4), and (5,5,5). The sample size is n = 1000. 



signal. Suppose 3D images Xi are taken on n = 1,000 subjects. We simulate image traits 
Xi from independent standard normals and quantitative traits Yi from independent normals 
with mean (Xi, B) and unit variance. Given the limited sample size, the hope is to infer a 
reasonable low rank approximation to the activity map from the 3D image covariates. The 
right panel displays the model deviance versus the degrees of freedom of a series of CP and 
Tucker model estimates. The CP model is estimated at ranks R = 1,. . . ,5. The Tucker 
model is fitted at orders [R 1 ,R 2 ,R 3 ) = (1,1,1), (2,2,2), (3,3,3), (4,4,3), (4,4,4), (5,4,4), 
(5, 5, 4), and (5, 5, 5). We see from the plot that, under the same number of free parameters, 
the Tucker model could generally achieve a better model fit with a smaller deviance. (Note 
that the deviance is in the log scale, so a small discrepancy between the two lines translates 
to a large value of difference in deviance.) 

The explicit model size formula of the Tucker model is also useful for choosing appropriate 
orders Rd's along each direction given data. This can be treated as a model selection problem, 
and we can employ a typical model selection criterion, e.g., Bayesian information criterion 
(BIC). It is of the form: — 21og£ + \og(n)p e , where £ is the log-likelihood, and p e = p^ is the 
effective number of parameters of the Tucker model as given in Table [TJ We will illustrate 



this BIC criterion in the numerical Section 6.1, and will discuss some heuristic guidelines of 



selecting orders in Section 6.4 
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3 Estimation 

We pursue the maximum likelihood estimation (MLE) for the Tucker tensor regression model 
and develop a scalable estimation algorithm in this section. The key observation is that, 
although the systematic part Q is not linear in G and B d jointly, it is linear in them 
separately. This naturally suggests a block relaxation algorithm, which updates each factor 
matrix Bd and the core tensor G alternately. 

The algorithm consists of two core steps. First, when updating Bd £ ]RP dXRd with the 
rest -Brf/'s and G fixed , we rewrite the array inner product in Q as 

(B,X) = (B {d) ,X (d )) 

= (B d G id) {B D <g> • • • ® B d+1 ® B d _ x ® • • • ® SO 7 , X (d) ) 
= (B d , X (d) (B D ® • • • <g> B d+ i <8> B d _i g> • • • <g> Bi)G[ d) ). 

Then the problem turns into a GLM regression with B^ as the "parameter" and the term 
X(d)(Bu<8>- • -<g)B d+ i(g)B d _iCg>- ■ ■®Bi)G J , d) as the "predictor". It is a low dimensional GLM 
with only pdRd parameters and thus is easy to solve. Second, when updating G £ lR Rl x '" xRd 
with all Bo's fixed, 

(B,X) = (vecB,vecX) 

= ((B D g> • • • ® Bi)vecG, vecX) 
= (vecG, (B D ® •••®Bi) T veaX). 

This implies a GLM regression with vecG as the "parameter" and the term (B# <g> • ■ ■ <8> 
Bi) T vecX as the "predictor" . Again this is a low dimensional regression problem with FT d R d 
parameters. For completeness, we summarize the above alternating estimation procedure in 
Algorithm [TJ The orthogonality between the columns of factor matrices Bd is not enforced 
as in unsupervised HOSVD, because our primary goal is approximating tensor signal instead 
of finding the principal components along each mode. 

Next we study the convergence properties of the proposed algorithm. As the block 
relaxation algorithm monotonically increases the objective value, the stopping criterion is 
well-defined and the convergence properties of iterates follow from the standard theory for 



monotone algorithms (de Leeuw, 1994 Lange, 2010). The proof of next result is given in 
the Appendix. 
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Algorithm 1 Block relaxation algorithm for fitting the Tucker tensor regression. 

Initialize: 7^ = argmax-y £(7, 0, . . . , 0), B d G WL PdXRi a random matrix for d = 1, . . . , D, 

and G^ G ]R r i x - xR d a random matrix. 

repeat 

for d = 1, ... ,D do 

Bf *> = argmax Bd *( 7 <*>, B [ t+1 \ . . . , b£?, B d , B d % ...,B» G®) 
end for 

G (*+D = argmaXG £( 7 W, B ? +1) , . . . , #g +1) , G) 
7 (t+i) = argmaX7 £( 7? S j*+ 1 ) ; . . . ; B g+!) 5 G (*+i)) 

until £(0 (t+1) )-£(0 W )<e 



Proposition 1. Assume (i) the log-likelihood function £ is continuous, coercive, i.e., the 
set {6 : £{6) > £(6^ ')} is compact, and bounded above, (ii) the objective function in each 
block update of AlgorithmUlis strictly concave, and (Hi) the set of stationary points (modulo 
nonsingular transformation indeterminacy) of £(7, G,Bi, . . . , Bd) are isolated. We have 
the following results. 

1. (Global Convergence) The sequence 0^' = (7^, G^\ B\ , . . . , B D ) generated by Al- 
gorithm^ converges to a stationary point of £(j, G, B 1 , . . . , Bd). 

2. (Local Convergence) Let (oo) = (^°°\ G^°°\ B^ , . . . , B { ™ ] ) be a strict local maxi- 
mum of £. The iterates generated by Algorithm^ are locally attracted to 6^°°' for 6^°' 
sufficiently close to 6^°°' . 

4 Statistical Theory 

In this section we study the usual large n asymptotics of the proposed Tucker tensor regres- 
sion. Regularization is treated in the next section for the small or moderate n cases. For 
simplicity, we drop the classical covariate Z in this section, but all the results can be straight- 
forwardly extended to include Z . We also remark that, although the usually limited sample 
size of neuroimging studies makes the large n asymptotics seem irrelevant, we still believe 
such an asymptotic investigation important, for several reasons. First, when the sample size 
n is considerably larger than the effective number of parameters p-r, the asymptotic study 
tells us that the model is consistently estimating the best Tucker structure approximation to 
the full array model in the sense of Kullback-Liebler distance. Second, the explicit formula 
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for score and information are not only useful for asymptotic theory but also for computation, 
while the identifiability issue has to be properly dealt with for the given model. Finally, the 
regular asymptotics can be of practical relevance, for instance, can be useful in a likelihood 
ratio type test in a replication study. 

4.1 Score and Information 

We first derive the score and information for the tensor regression model, which are essential 
for statistical estimation and inference. The following standard calculus notations are used. 
For a scalar function /, V/ is the (column) gradient vector, df = [V/] T is the differential, 
and d 2 f is the Hessian matrix. For a multivariate function g : 1R P *— y IR 9 , Dg G lR pxq 
denotes the Jacobian matrix holding partial derivatives -^-. We start from the Jacobian and 
Hessian of the systematic part r\ = g(/j,) in Q. 



-,R d i s 



Lemma 2. 1. The gradient Vr]{B u ..., B D ) G lR n " fli+E ^' Pdl 

Vr)(G, B 1 ,...,B D ) = [B D ®.--®B 1 J 1 J 2 ■■■ J D ] T (vecX), 
where J d G JRUd=iPd><PdRd ^ the Jacobian 

J d = DB{B d ) = n d {[(B D ® • • • <g> B d+1 ® B d _ x ® - • • <g> B l )G T {rl) } ® I Pd } (5) 

and Ilrf is the {Y[ d =iPd)-by-{Y[ d= iPd) permutation matrix that reorders vecB( d ) to ob- 
tain vecB, i.e., vecB = II d vecB( d y 

2. Let the Hessian d 2 r}{G,B u . . .,B D ) G IR(n^+E d PcA)x(n^+EdP^) fe partitioned 
into four blocks H GG G M^" RdX ^ Rd , h gb = H T B G G IRn^xE d p<A and H BB G 
IRE d PdfidxE d Pdfl d . Then H GG = 0, H GB has entries 

' l (r 1 ,...,r D ),(id,Sd) = *-{r<i=Sd} / , X ji,---,JD J_ 1 Pj d ' ' 

jd=id d'^d 

and H B B has entries 



* l (id,rd),(id'> r d') ~ {&£ d '} 2-c X h,---,3D /-^t 9si,...,s D J_J_ Pj 



j( s d") 

3d" ' 
3d=id,Jd' =i d> s d=rd,s d ,=r d , d"^d,d' 
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Furthermore, Hb,b can be partitioned in D 2 sub-blocks as 



( * 

H 21 



* * 



V 



Hm H 



Dl **D2 



* 




/ 



The elements of sub- block H d d' £ lR PdRdXp d' R d> can b e retrieved from the matrix 

X {M) {B D <g> ■ ■ ■ <g> B d+l <g> B d _! <g> • ■ • O B d , +1 <g> B d /_i ® • - • <8> B^Gj^y 

Hg,b can be partitioned into D sub-blocks as (Hi, . . . , Hd) ■ The sub-block H<i G 
]RTld R dxPdRd ^ as a i mos t pdYld-Rd nonzero entries which can be retrieved from the 
matrix 

X (d) (B D <8> • • • ® B d+1 (8) B d _! (8) • • • ® Bi). 

Let £(B l7 . . . , B D \y, x) = In p(y\x, B 1; . . . , B D ) be the log-density of GLM. Next result 
derives the score function, Hessian, and Fisher information of the Tucker tensor regression 
model. 

Proposition 2. Consider the tensor regression model defined by Oj and Oj. 

1. The score function (or score vector) is 

(v-nWiv)* 



V£(G,B 1 ,...,B D ) 



a" 



-Vrj(G,B x ,...,B D ) 



with Vr)(G, Bi, . . . , Bd) given in LemmaU^ 



2. The Hessian of the log-density I is 



(6) 



H(G,Bi, . . . ,B D ) 

[Av)} 2 (v - »)0"(rj) 

a 2 a" 2 

(y - n)9>(ri) 



Vr)(G, Bi,..., B D )d V (G, Bi,...,B D ) 



d 2 r ] (G,B 1 ,...,B D ), 



(7) 



with d rj defined in Lemma\!2\ 
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3. The Fisher information matrix is 

I(G,Bi,. . . ,B D ) 
= E[-H(G,B 1 ,...,B D )] 
= Var[W(G, B 1 ,..., B D )d£(G, B u . . . , B D )\ 

wm 2 



a 2 



B D ®---®B 1 J 1 ... J D ] T (vecX)(vecX) T [B D ® • • • ® B x J\ ■ ■ ■ J 



D\ 



Remark 2.1: For canonical link, 9 = 7], 9'(rj) = 1, 9"{rj) = 0, and the second term of Hessian 
vanishes. For the classical GLM with linear systematic part (D = 1), d 2 rj(G,Bi, . . . ,Bd) 
is zero and thus the third term of Hessian vanishes. For the classical GLM (.D = l) with 
canonical link, both second and third terms of the Hessian vanish and thus the Hessian is 
non-stochastic, coinciding with the information matrix. 

4.2 Identifiability 

The Tucker decomposition (J3J) is unidentifiable due to the nonsingular transformation inde- 
terminacy. That is 

[G; B 1 ,...,B D j = lGx 1 Oi 1 x---x D O" 1 ; B x O u ..., B d O d \ 

for any nonsingular matrices Od € \W RdXRd . This implies that the number of free parameters 
for a Tucker model is 'YjdPdRd + Wd^-d — ^2d^-d^ with the last term adjusting for nonsingular 
indeterminacy. Therefore the Tucker model is identifiable only in terms of the equivalency 
classes. 

For asymptotic consistency and normality, it is necessary to adopt a specific constrained 
parameterization. It is common to impose the orthonormality constraint on the factor ma- 
trices B J d Bd = lR d , d = 1, . . . , D. However the resulting parameter space is a manifold and 
much harder to deal with. We adopt an alternative parameterization that fixes the entries 
of the first Rd rows of Bd to be ones 

B = {lG;B 1 ,...,B D j:^ r d ) = l,t d = l,...,R d ,d=l,...,D}. 

The formulae for score, Hessian and information in Proposition|2]require changes accordingly. 
The entries in the first R d rows of Bd are fixed at ones and their corresponding entries, 
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rows and columns in score, Hessian and information need to be deleted. Choice of the 
restricted space B is obviously arbitrary, and excludes arrays with any entries in the first 
rows of Bd equal to zeros. However the set of such exceptional arrays has Lebesgue measure 
zero. In specific applications, subject knowledge may suggest alternative restrictions on the 
parameters. 

Given a finite sample size, conditions for global identifiability of parameters are in general 
hard to obtain except in the linear case (D = 1). Local identifiability essentially requires 
linear independence between the "collapsed" vectors [Bd <g> ■ ■ ■ <g> B\ J\ . . . Jo] T veccCj G 

Proposition 3 (Identifiability). Given iid data points {{j)i, Xi), % — 1, . . . , n} from the Tucker 
tensor regression model. Let Bq G B be a parameter point and assume there exists an open 
neighborhood of B in which the information matrix has a constant rank. Then B is locally 
identifiable if and only if 



2 , — ( v ec Xi) (vec Xi) 



i=X 



\B D (8) • • • ® B x Ji . . . J 



D 



J (B ) = [B D ® • • • ® B x Ji . . . J D ] 

is nonsingular. 

4.3 Asymptotics 

The asymptotics for tensor regression follow from those for MLE or M-estimation. The 
key observation is that the nonlinear part of tensor model Q is a degree-/} polynomial of 
parameters and the collection of polynomials {(B , X) , B G B} form a Vapnik-Cervonenkis 



(VC) class. Then the classical uniform convergence theory applies (van der Vaart, 1998). 



For asymptotic normality, we need to establish that the log-likelihood function of tensor 



regression model is quadratic mean differentiable (Lehmann and Romano, 2005). A sketch 
of the proof is given in the Appendix. 

Theorem 1. Assume B G B is (globally) identifiable up to permutation and the array 
covariates Xi are iid from a bounded underlying distribution. 

1. (Consistency) The MLE is consistent, i.e., B n converges to B in probability, in fol- 
lowing models. (1) Normal tensor regression with a compact parameter space Bq C B. 
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(2) Binary tensor regression. (3) Poisson tensor regression with a compact parameter 
space B C B. 

2. (Asymptotic Normality) For an interior point Bo G B with nonsingular information 
matrix I(B ) M) and B n is consistent, y/n(vecB n — vecB ) converges in distribution 
to a normal with mean zero and covariance matrix I~ 1 (B ). 

In practice it is rare that the true regression coefficient B tTUC e ]R Pl x '" x p d is exactly a low 
rank tensor. However the MLE of the rank-i? tensor model converges to the maximizer of 
function M(B) = Ps truc hip B or equivalently PB truo ^(.PB/PBtmJ- m other words, the MLE 
consistently estimates the best approximation (among models in B) of -B trU e in the sense of 
Kullback-Leibler distance. 

5 Regularized Estimation 

Regularization plays a crucial role in neuroimaging analysis for several reasons. First, even 
after substantial dimension reduction by imposing a Tucker structure, the number of pa- 
rameters j>t can still exceed the number of observations n. Second, even when n > px, 
regularization could potentially be useful for stabilizing the estimates and improving the risk 
property. Finally, regularization is an effective way to incorporate prior scientific knowledge 
about brain structures. For instance, it may sometimes be reasonable to impose symmetry 
on the parameters along the coronal plane for MRI images. 

In our context of Tucker regularized regression, there are two possible types of regular- 
izations, one on the core tensor G only, and the other on both G and B^ simultaneously. 
Which regularization to use depends on the practical purpose of a scientific study. In this 
section, we illustrate the regularization on the core tensor, which simultaneously achieves 
sparsity in the number of outer products in Tucker decomposition (|3| and shrinkage. Toward 
that purpose, we propose to maximize the regularized log-likelihood 

t(>y,G,B 1 ,...,B D )- J2 p v(\9n,...,r D \,X), 

ri,...,r D 

where P^(|x|, A) is a scalar penalty function, A is the penalty tuning parameter, and r\ is an 
index for the penalty family. Note that the penalty term above only involves elements of the 
core tensor, and thus regularization on G only. This formulation includes a large class of 
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penalty functions, including power family (Frank and Friedman, 1993), where P^(|x|,A) = 
X\x\ v , rj e (0,2], and in particular lasso ( Tibshiram} 1996) (r] = 1) and ridge (rj = 2); elastic 



net (ZouandHastie, 2005), where P v (\x\, A) = \[(rj - l)a; 2 /2 + (2 - rj)\x\], r] e [1,2]; SCAD 



(Fan and Li 2001), where d/d\x\Pr,(\x\,\) = A {1{| x |<a} + 0M - \x\) + /(rj - 1)\1 { \ X \ >X} }, 



r] > 2; and MC+ penalty ( |Zhang[ |2010[ ), where P^(|a;|,A) = {A|x| - x 2 /(2rj)} l{\ x \ <v x} + 
0.5\ 2 r}l{\ x \> v \}, among many others. 

Two aspects of the proposed regularized Tucker regression, parameter estimation and 
tuning, deserve some discussion. For regularized estimation, it incurs only slight changes in 
Algorithm [I] That is, when updating G, we simply fit a penalized GLM regression problem, 



.,Bg +1, ,G) 



E *.< 



\9n, 



■,rD\i 



A), 



G( m ) = argmax G £( 7 M J Bf +1) , 

ri,...,r D 

for which many software packages exist. Our implementation utilizes an efficient Matlab 
toolbox for sparse regression (Zhou et al. , 2011). Other steps of Algorithm [I] remain un- 
changed. For the regularization to remain legitimate, we constrain the column norms of 
Bj to be one when updating factor matrices B^. For parameter tuning, one can either use 
the general cross validation approach, or employ Bayesian information criterion to tune the 
penalty parameter A. 

6 Numerical Study 

We have carried out intensive numerical experiments to study the finite sample performance 
of the Tucker regression. Our simulations focus on three aspects: first, we demonstrate the 
capacity of the Tucker regression in identifying various shapes of signals; second, we study 
the consistency property of the method by gradually increasing the sample size; third, we 



compare the performance of the Tucker regression with the CP regression of Zhou et al. 



(2013). We also examine a real MRI imaging data to illustrate the Tucker downsizing and 



to further compare the two tensor models. 

6.1 Identification of Various Shapes of Signals 

In our first example, we demonstrate that the proposed Tucker regression model, though 
with substantial reduction in dimension, can manage to identify a range of two dimensional 



1* 



signal shapes with varying ranks. In Figure |2j we list the 2D signals B £ JR 6j * x 6 ^ in the first 
row, along with the estimates by Tucker tensor models in the second to fourth rows with 
orders (1, 1), (2, 2) and (3, 3), respectively Note that, since the orders along both dimensions 
are made equal, the Tucker model is to perform essentially the same as a CP model in this 
example, and the results are presented here for completeness. We will examine differences of 
the two models in later examples. The regular covariate vector Z e IR 5 and image covariate 
X G IR^ X ^ are randomly generated with all elements being independent standard normals. 
The response Y is generated from a normal model with mean /i = ~y J Z+ (B, X) and variance 
var(yu)/10. The vector coefficient 7 = I5, and the coefficient array B is binary, with the 
signal region equal to one and the rest zero. Note that this problem differs from the usual 



edge detection or object recognition in imaging processing (Qiu, 2005, 2007). In our setup, all 
elements of the image X follow the same distribution. The signal region is defined through 
the coefficient matrix B and needs to be inferred from the relation between Y and X after 
adjusting for Z. It is clearly see in Figure [2] that, the Tucker model yields a sound recovery 
of the true signals, even for those of high rank or natural shape, e.g., "disk" and "butterfly". 



We also illustrate in the plot the BIC criterion in Section 2.4 



6.2 Performance with Increasing Sample Size 

In our second example, we continue to employ a similar model as in Figure [2]but with a three 
dimensional image covariate. The dimension of X is set as pi xp 2 xp 3 , with pi = P2 = P3 = 16 
and 32, respectively. The signal array B is generated from a Tucker structure, with the 
elements of core tensor G and the factor matrices -B's all coming from independent standard 
normals. The dimension of the core tensor G is set as R\xR 2 xR 3 , with Ri = R 2 = R3 = 2, 5, 
and 8, respectively. We gradually increase the sample size, starting with an n that is in 
hundred and no smaller than the degrees of freedom of the generating model. We aim to 
achieve two purposes with this example: first, we verify the consistency property of the 
proposed estimator, and second, we gain some practical knowledge about the estimation 
accuracy with different values of the sample size. Figure [3] summarizes the results. It is 
clearly seen that the estimation improves with the increasing sample size. Meanwhile, we 
observe that, unless the core tensor dimension is small, one would require a relatively large 
sample size to achieve a good estimation accuracy. This is not surprising though, considering 
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Figure 2: True and recovered image signals by Tucker regression. The matrix variate has size 
64 by 64 with entries generated as independent standard normals. The regression coefficient 
for each entry is either (white) or 1 (black). The sample size is 1000. TR(r) means estimate 
from the Tucker regression with an r-by-r core tensor. 
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the number of parameters of the model and that regularization is not employed here. The 
proposed tensor regression approach has been primarily designed for imaging studies with a 
reasonably large number of subjects. Recently, a number of such large-scale brain imaging 
studies are emerging. For instance, the Attention Deficit Hyperactivity Disorder Sample 



Initiative (ADHD, 2013) consists of over 900 participants from eight imaging centers with 



both MRI and fMRI images, as well as their clinical information. Another example is the 



Alzheimer's Disease Neuroimaging Initiative (ADNI, 2013) database, which accumulates over 



3,000 participants with MRI, fMRI and genomics data. In addition, regularization discussed 
in Section [5] and the Tucker downsizing in Section 2J3 can both help improve estimation 
given a limited sample size. 

6.3 Comparison of the Tucker and CP Models 

In our third example, we focus on comparison between the Tucker tensor model with the 



CP tensor model of Zhou et al. (2013). We generate a normal response, and the 3D signal 



array B with dimensions pi,P2,P3 and the <i-ranks ri,r2,r^. Here, the ci-rank is defined 
as the column rank of the mode-<i matricization Bu) of B. We set Pi = P2 = Pz = 16 
and 32, and (ri,r 2 ,r 3 ) = (5,3, 3), (8,4,4) and (10,5,5), respectively. The sample size is 
2000. We fit a Tucker model with Rd = ra, and a CP model with R = maxr^, d — 1, 2, 3. 
We report in Table [2] the degrees of freedom of the two models under different setup, as 
well as the root mean squared error (RMSE) out of 100 data replications. It is seen that 
the Tucker model requires a smaller number of free parameters, while it achieves a more 
accurate estimation compared to the CP model. Such advantages come from the flexibility 
of the Tucker decomposition that permits different orders Rd along directions. 

6.4 Attention Deficit Hyperactivity Disorder Data Analysis 

We analyze the attention deficit hyperactivity disorder (ADHD) data from the ADHD-200 



Sample Initiative (ADHD, 2013) to illustrate our proposed method as well as the Tucker 



downsizing. ADHD is a common childhood disorder and can continue through adolescence 
and adulthood. Symptoms include difficulty in staying focused and paying attention, dif- 
ficulty in controlling behavior, and over-activity The data set that we analyzed is part 
of the ADHD-200 Global Competition data sets. It was pre-partitioned into a training 
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Figure 3: Root mean squared error (RMSE) of the tensor parameter estimate versus the 

sample size. Reported are the average and standard deviation of RMSE based on 100 

data replications. Top: R\ = R2 = R3 = 2; Middle: Ri = R 2 = R3 = 5; Bottom: 
Ri = i?2 = R3 — 8. 
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Table 2: Comparison of the Tucker and CP models. Reported are the average and stan- 
dard deviation (in the parenthesis) of the root mean squared error, all based on 100 data 
replications. 



Dimension Criterion 


Model 


(5,3,3) 


(8,4,4) 


(10,5,5) 


16 x 16 x 16 Df 


Tucker 


178 


288 


420 




CP 


230 


368 


460 


RMSE 


Tucker 


0.202 (0.013) 


0.379 (0.017) 


0.728 (0.030) 




CP 


0.287 (0.033) 


1.030 (0.081) 


2.858 (0.133) 


32 x 32 x 32 Df 


Tucker 


354 


544 


740 




CP 


470 


752 


940 


RMSE 


Tucker 


0.288 (0.013) 


0.570 (0.023) 


1.236 (0.045) 




CP 


0.392 (0.046) 


1.927 (0.172) 


16.238 (3.867) 



data of 770 subjects and a testing data of 197 subjects. We removed those subjects with 
missing observations or poor image quality, resulting in 762 training subjects and 169 test- 
ing subjects. In the training set, there were 280 combined ADHD subjects, 482 normal 
controls, and the case-control ratio is about 3:5. In the testing set, there were 76 com- 
bined ADHD subjects, 93 normal controls, and the case-control ratio is about 4:5. Tl- 
weighted images were acquired for each subject, and were preprocessed by standard steps. 
The data we used is obtained from the Neuro Bureau after preprocessing (the Burner data, 
http://neurobureau.projects.nitrc.org/ADHD200/Data.html). In addition to the MRI image 
predictor, we also include the subjects' age and handiness as regular covariates. The re- 
sponse is the binary diagnosis status. 

The original image size was p\ x p 2 x p 3 = 121 x 145 x 121. We employ the Tucker 
downsizing in Section 2J3 More specifically, we first choose a wavelet basis for B& G ]R PdX ^ d , 
then transform the image predictor from X to X = [X; B\, . . . , BJjJ. We pre-specify the 
values of p^s that are about tenth of the original dimensions p^, and equivalently, we fit a 
Tucker tensor regression with the image predictor dimension downsized to p\ xp 2 xp 3 . In our 
example, we have experimented with a set of values of pd's, and the results are qualitatively 
similar. We report two sets, p\ = 12, p 2 = 14, p 3 = 12, and p\ = 10, p 2 = 12, p 3 = 10. We 
have also experimented with the Haar wavelet basis (Daubechies D2) and the Daubechies 
D4 wavelet basis, which again show similar qualitative patterns. 

For p 1 = 12, p 2 = 14, p 3 = 12, we fit a Tucker tensor model with R 1 = R 2 = R 3 = 3, 
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Table 3: ADHD testing data misclassification error. 



Basis 


Reduced dimension 


Reg- Tucker 


Reg-CP 


Tucker 


CP 


Haar (D2) 


12 x 14 x 12 
10 x 12 x 10 


0.361 
0.343 


0.367 
0.390 


0.379 
0.379 


0.438 
0.408 


Daubechies (D4) 


12 x 14 x 12 
10 x 12 x 10 


0.337 
0.320 


0.385 
0.396 


0.385 
0.367 


0.414 
0.373 



resulting in 114 free parameters, and fit a CP tensor model with R = 4, resulting in 144 free 
parameters. For p% = 10, P2 = 12, p% = 10, we fit a Tucker tensor model with R± = R2 = 2 
and -R3 = 3, resulting in 71 free parameters, and fit a CP tensor model with R = 4, resulting 
in 120 free parameters. We have chosen those orders based on the following considerations. 
First, the number of free parameters of the Tucker and CP models are comparable. Second, 
at each step of GLM model fit, we ensure that the ratio between the sample size n and 
the number of parameters under estimation in that step pd x Rd satisfies a heuristic rule of 
greater than two in normal models and greater than five in logistic models. In the Tucker 
model, we also ensure the ratio between n and the number of parameters in the core tensor 
estimation \\ d R d satisfies this rule. We note that this selection of Tucker orders is heuristic; 
however, it seems to be a useful guideline especially when the data is noisy. We also fit 
a regularized Tucker model and a regularized CP model with the same orders, while the 
penalty parameter is tuned based on 5-fold cross validation of the training data. 

We evaluate each model by comparing the misclassification error rate on the independent 
testing set. The results are shown in Table [3j We see from the table that, the regularized 
Tucker model performs the best, which echoes the findings in our simulations above. We 
also remark that, considering the fact that the ratio of case-control is about 4:5 in the 
testing data, the misclassification rate from 0.32 to 0.36 achieved by the regularized Tucker 
model indicates a fairly sound classification accuracy. On the other hand, we note that, 
a key advantage of our proposed approach is its capability of suggesting a useful model 
rather than the classification accuracy per se. This is different from black-box type machine 
learning based imaging classifiers. 

It is also of interest to compare the run times of the two tensor model fittings. We record 
the run times of fitting the Tucker and CP models with the ADHD training data in Table 
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Table 4: ADHD model fitting run time (in seconds). 



Basis 


Reduced dimension 


Reg- Tucker 


Reg-CP 


Tucker 


CP 


Haar (D2) 


12 x 14 x 12 
10 x 12 x 10 


3.68 
1.36 


4.39 
2.79 


31.25 
9.08 


22.43 
25.10 


Daubechies (D4) 


12 x 14 x 12 
10 x 12 x 10 


3.30 
1.92 


2.18 
1.90 


16.87 
9.96 


26.34 
17.10 



|4| They are comparable. 



7 Discussion 



We have proposed a tensor regression model based on the Tucker decomposition. Including 



the CP tensor regression (Zhou et al. 5 2013) as a special case, Tucker model provides a more 



flexible framework for regression with imaging covariates. We develop a fast estimation 
algorithm, a general regularization procedure, and the associated asymptotic properties. In 
addition, we provide a detailed comparison, both analytically and numerically, of the Tucker 
and CP tensor models. 

In real imaging analysis, the signal hardly has an exact low rank. On the other hand, 
given the limited sample size, a low rank estimate often provides a reasonable approximation 
to the true signal. This is why the low rank models such as the Tucker and CP could offer 
a sound recovery of even a complex signal. 

The tensor regression framework established in this article is general enough to encompass 
a large number of potential extensions, including but not limited to imaging multi-modality 
analysis, imaging classification, and longitudinal imaging analysis. These extensions consist 
of our future research. 
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